Add mixed precision training to TorchEngine #1322

JackTemaki · 2023-05-04T14:43:47Z

Uses torch_amp as config dict with dtype option.

Adds GradScaler to engine, and applies autocast and the scaler during training if amp is enabled. Uses grad_scaler as config option to explicitly configure it.

Fixes #1334.

uses torch_amp_options as config dict with "dtype" option. Adds GradScaler to engine, and applies autocast and the scaler during training if amp is enabled.

albertz

Looks fine.

albertz · 2023-05-04T15:09:42Z

I wonder, in what cases would you want that the model params also use the same dtype? I have read that when using bfloat16, you usually always want that the params are also stored in bfloat16.

With float16 AMP training, is it normal that you would keep the params in float32?

What about TensorFloat32?

So I wonder if this is sth which should also automatically be handled via torch_amp_options or whether that should be a separate option. And I wonder how people usually do it.

returnn/torch/engine.py

albertz · 2023-05-15T07:35:19Z

I just merge this as an initial version now.

Can you comment on my questions?

albertz · 2023-05-15T07:46:24Z

Another question: Was it by intention that you allowed that the user does not specify dtype? I also wonder, if you use dtype=None, what exactly is the behavior of autocast?

JackTemaki · 2023-05-15T08:03:46Z

With float16 AMP training, is it normal that you would keep the params in float32?

When entering an autocast-enabled region, Tensors may be any type. You should not call half() or bfloat16() on your model(s) or inputs when using autocasting.
https://pytorch.org/docs/stable/amp.html#torch.autocast

So yes, you should not do anything to the model explicitly because autocast is handling that.

Another question: Was it by intention that you allowed that the user does not specify dtype? I also wonder, if you use dtype=None, what exactly is the behavior of autocast?

No, this is a mistake, it should be given. I do not know what the behavior is then, but likely not what is intended.

albertz · 2023-05-15T08:08:15Z

So yes, you should not do anything to the model explicitly because autocast is handling that.

Autocast is automatically casting inputs to certain PyTorch ops. The parameters are not changed, they are just casted automatically for those ops. But this is not really my question. My question is: Wouldn't it make more sense to directly have the parameters in float16?

Follow-up to #1322

Follow-up to #1322 Rename torch_amp_options to torch_amp. Allow simply `torch_amp = 'float16'` in config. Allow to specify grad_scaler separately.

albertz · 2023-05-15T08:24:54Z

I renamed torch_amp_options just to torch_amp.

I was even thinking about renaming it to just amp, or maybe also autocast. We also don't have torch_ prefix for other things, and you might want to use this also for other backends. For example, jmp implements automatic mixed precision training for JAX.

JackTemaki · 2023-05-15T08:27:44Z

So yes, you should not do anything to the model explicitly because autocast is handling that.

Autocast is automatically casting inputs to certain PyTorch ops. The parameters are not changed, they are just casted automatically for those ops. But this is not really my question. My question is: Wouldn't it make more sense to directly have the parameters in float16?

I see no indication why, unless you really want your whole network to run in float16.

albertz · 2023-05-15T08:31:44Z

My question is: Wouldn't it make more sense to directly have the parameters in float16?

I see no indication why

Because that further reduce the memory requirement? Why would you not want that? What are the downsides?

I'm not saying that everything should be float16. Maybe certain ops must stay in float32. I thought this is the main aspect of autocast/AMP, to cast to float16 wherever it makes sense.

I just don't understand why weights are stored in float32, and then always auto-casted. That also adds some overhead in computation (the casting), and requires more memory. Unless there is maybe some reason. But that is my question, what is the reason for this?

albertz · 2023-05-15T08:36:22Z

Ah, I was just checking the original paper introducing automatic mixed precision training, and it explains it (Sec 3.1):

In mixed precision training, weights, activations and gradients are stored as FP16. In order to match the accuracy of the FP32 networks, an FP32 master copy of weights is maintained and updated with the weight gradient during the optimizer step. In each iteration an FP16 copy of the master weights is used in the forward and backward pass. ...

While the need for FP32 master weights is not universal, there are two possible reasons why a number of networks require it. One explanation is that updates (weight gradients multiplied by the learning rate) become too small to be represented in FP16 - any value whose magnitude is smaller than $2^{−24}$ becomes zero in FP16. ...

Another explanation is that the ratio of the weight value to the weight update is very large. In this case, even though the weight update is representable in FP16, it could still become zero when addition operation right-shifts it to align the binary point with the weight. ...

uses torch_amp_options as config dict with "dtype" option. Adds GradScaler to engine, and applies autocast and the scaler during training if amp is enabled.

JackTemaki · 2023-05-16T10:31:42Z

Now that the code is longer, we might want to move this into the updater or add an extra module instead of having it plain in the engine.

Add mixed precision training to TorchEngine

08175f6

uses torch_amp_options as config dict with "dtype" option. Adds GradScaler to engine, and applies autocast and the scaler during training if amp is enabled.

JackTemaki requested review from a team and albertz as code owners May 4, 2023 14:43

albertz requested a review from Icemole May 4, 2023 15:04

albertz approved these changes May 4, 2023

View reviewed changes

albertz reviewed May 4, 2023

View reviewed changes

returnn/torch/engine.py Show resolved Hide resolved

albertz merged commit 003b2e8 into master May 15, 2023

albertz deleted the nick-add-amp branch May 15, 2023 07:35

albertz added a commit that referenced this pull request May 15, 2023

PyTorch AMP / autocast / grad scaler cleanups

57b171e

Follow-up to #1322

albertz added a commit that referenced this pull request May 15, 2023

PyTorch AMP / autocast / grad scaler cleanups

144339f

Follow-up to #1322 Rename torch_amp_options to torch_amp. Allow simply `torch_amp = 'float16'` in config. Allow to specify grad_scaler separately.

This was referenced May 16, 2023

RETURNN frontend with PyTorch: Automatic mixed precision (AMP) support #1311

Closed

PyTorch automatic mixed precision (AMP) support #1334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mixed precision training to TorchEngine #1322

Add mixed precision training to TorchEngine #1322

JackTemaki commented May 4, 2023 •

edited by albertz

Loading

albertz left a comment

albertz commented May 4, 2023 •

edited

Loading

albertz commented May 15, 2023

albertz commented May 15, 2023

JackTemaki commented May 15, 2023

albertz commented May 15, 2023

albertz commented May 15, 2023

JackTemaki commented May 15, 2023

albertz commented May 15, 2023

albertz commented May 15, 2023

JackTemaki commented May 16, 2023

Add mixed precision training to TorchEngine #1322

Add mixed precision training to TorchEngine #1322

Conversation

JackTemaki commented May 4, 2023 • edited by albertz Loading

albertz left a comment

Choose a reason for hiding this comment

albertz commented May 4, 2023 • edited Loading

albertz commented May 15, 2023

albertz commented May 15, 2023

JackTemaki commented May 15, 2023

albertz commented May 15, 2023

albertz commented May 15, 2023

JackTemaki commented May 15, 2023

albertz commented May 15, 2023

albertz commented May 15, 2023

JackTemaki commented May 16, 2023

JackTemaki commented May 4, 2023 •

edited by albertz

Loading

albertz commented May 4, 2023 •

edited

Loading